Every "cut your LLM costs" tutorial says the same thing: send fewer input tokens. Trim the system prompt, summarise the history, shrink the context. It feels obvious, and it is mostly wrong. Input-token count is the cheapest lever on the board, and optimising it first is how teams spend a week saving 8% while the real cost sits untouched.

The bill is set by a mix of three token types priced very differently. Once you see the mix, a counterintuitive move follows: sending more input tokens, in the right shape, routinely lowers the total cost of a request. Here is the mechanism, with the retrieval pipeline I use to exploit it.

1. Three token types, three prices

A single LLM call is not billed at one rate. There are at least three, and they are not close to each other:

1x
Fresh input tokens. The base rate, and the one everyone obsesses over.
3-5x
Output tokens. Generation is where the money goes (Claude ~5x, GPT-4o ~4x the input rate).
0.1-0.5x
Cached input tokens. Anthropic reads at 0.1x (90% off), OpenAI ~0.5x, Gemini ~0.25x.

Two consequences fall straight out of this table. First, one output token can cost as much as thirty to fifty cached input tokens. If you are going to optimise anything, optimise the length of what the model writes, not what it reads. Second, input tokens are not a fixed price: a token you can serve from cache is almost free. "Input tokens" is not one number, it is two, and the gap between them is an order of magnitude.

The fallacy stated plainly Treating "tokens" as a single quantity to minimise. The metric that maps to your invoice is the weighted mix: (fresh_in x 1) + (cached_in x 0.1) + (out x 5). Minimising raw token count optimises the term with the smallest coefficient.

2. Where the money actually leaks: output tokens

Here is the part most cost guides miss. The shape of your input controls the length of your output. Feed a model a messy, unranked context dump and it does not just answer. It orients itself out loud.

You have seen the symptom in the generated text: "Based on the provided documents, it appears that several sources discuss... Document 3 seems most relevant, although Document 1 also touches on... Let me synthesise these..." That is the model spending expensive output tokens narrating its own search through context you handed it in a bad order. Noisy, contradictory, or unranked chunks make this worse, because the model hedges, restates, and re-derives instead of answering.

Naive RAG, which embeds documents and stuffs the top-k cosine matches into the prompt, is a machine for producing exactly this. The top cosine matches are not the most useful chunks, only the most superficially similar, so the model gets a pile it has to sort through at 5x the input price.

3. The quieter leak: cache misses

Prompt caching only fires on a stable prefix. The provider hashes the leading tokens of your request and reuses the computed attention state if the next request starts identically. Break the prefix by a single byte and you pay full freight again.

This is where people misread caching, so let me be precise about what it does and does not buy you in RAG:

So caching rewards determinism. A retrieval stage that emits the same ranked payload the same way every time is cache-friendly. One that returns chunks in nondeterministic order is quietly paying the miss penalty on its scaffold and across every agent turn.

4. The fix: spend input tokens to buy back output tokens

This is the inversion. Instead of minimising the context, you invest in making it structured, ranked, and deterministic, then let that smaller-but-richer payload collapse the output and stabilise the prefix. The retrieval pipeline that does it has three stages.

First, hybrid retrieval: run dense vector search and BM25 keyword search in parallel, then fuse them with Reciprocal Rank Fusion. RRF needs no score normalisation because only rank positions matter, so a chunk that both methods rank highly floats to the top. This is the actual fusion code from my engine:

Python — Reciprocal Rank Fusion (from rag-knowledge-engine/retriever.py) @staticmethod def _rrf(vec_hits, bm25_hits, rrf_k=60): """Fuse two ranked lists via RRF. A doc in both lists scores higher.""" def key(doc): return (doc["file"], doc["chunk"]) rrf_scores = {} for rank, doc in enumerate(vec_hits): k = key(doc); rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1) for rank, doc in enumerate(bm25_hits): k = key(doc); rrf_scores[k] = rrf_scores.get(k, 0.0) + 1.0 / (rrf_k + rank + 1) seen = {} for doc in vec_hits + bm25_hits: seen.setdefault(key(doc), doc) return sorted(seen.values(), key=lambda d: rrf_scores.get(key(d), 0.0), reverse=True)

Second, cross-encoder reranking: take the fused top-50 candidate pool and rescore it with a cross-encoder that reads the query and each passage together. A bi-encoder embeds query and passage separately and compares vectors; a cross-encoder judges them jointly and is far more accurate about actual relevance. You run it only on the 50-candidate pool, so the latency is bounded, and you keep the top 5.

Third, deterministic formatting: emit those 5 chunks in rank order, numbered, in a fixed template, every time. Same query shape, same bytes.

The payload that reaches the model might be 20% larger than a naive top-5 cosine dump, because reranking lets you confidently include a couple more high-value chunks. But it is clean and ordered. The model stops narrating its orientation and answers directly. Output tokens fall, and the deterministic block caches across agent turns.

5. The math, worked

Take a representative 2026 pricing point of $3 per million input tokens, $15 per million output (5x), and cached reads at $0.30 per million (0.1x). Compare one query under each approach. These numbers are illustrative, not a benchmark of your workload, but the direction is the point.

Cost per query — naive vs structured NAIVE: 4,000 fresh input + 800 output (model hedges through unranked chunks) = 4000 * $3/M + 800 * $15/M = $0.0120 + $0.0120 = $0.0240 STRUCTURED: 4,800 input (+20%), but 4,000 is a stable cached prefix, and clean ranking cuts output to 350 tokens = (4000 cached * $0.30/M) + (800 fresh * $3/M) + (350 out * $15/M) = $0.0012 + $0.0024 + $0.00525 = $0.00885 Result: ~63% cheaper, while sending MORE input tokens.

The structured request processes more input and still costs a third as much, because it moved spend off the two expensive coefficients (output and fresh input) and onto the cheap one (cached input). That is the whole trick. You did not minimise tokens. You re-balanced the mix.

The rule that replaces "send fewer tokens" Minimise output tokens first (rank and clean the context so the model answers instead of narrating). Make your prefix byte-stable so it caches. Only then trim raw input, and even then, gladly spend input tokens if they buy back output tokens at the 5x rate.

6. When this does not apply

Honesty about the boundaries, because this is not a universal law:

What I Built

The pipeline above is the RAG Knowledge Engine: hybrid BM25 + vector retrieval fused with RRF, a cross-encoder reranker over the top-50 pool, deterministic context formatting, and a RAGAS-style evaluator. 25 tests, all mocked, so the suite runs with no live services. The same retrieval shape powers the grounded chatbot in the corner of this site. The cost behaviour described here is the reason it is built this way, not a naive top-k dump.

If you take one thing from this: stop counting tokens, and start pricing the mix.